Probability concepts

Example: We want to make statements about winter days and snow days. This means that there are 4 possible event types. We can visualize the probability that any given day falls into one of these for classes as follows.

Marginal probabilities are just the probabilities of event, independent of anything else

The joint probability can be visualized as the intersection of both event types winter and snow:

Finally, the conditional probability fixes one event type to a specific value, and asks “Given we fixed this first event type, what is the probability of the second even type.” We are focusing only on this subset off all possible event types:

The conditional probability is then just the joint probability relative to (divided by) the probability of the first, fixed event type.

If we use simulations to calculate these probabilities, as illustrated in the next figure, the problem is reduced to simple counts:

Do you recognize the product rule here?

\[ P(snow|winter) = \frac{\text{number of snowy winter days}}{\text{number of winter days}} \]

Here is the whole thing with A and B instead of snow and winter.

\[ \overset{\color{violet}{\text{conditional probability}}}{P(A|B)} = \frac{\overset{\color{red}{\text{joint probability}}}{P(A, B)}}{\overset{\color{blue}{\text{marginal probability}}}{P(B)}} \] and by multiplying with \(P(B)\) on both sides, we get first

This is he product rule, which links joint and conditional probabilities.

\[ \overset{\color{violet}{\text{conditional probability}}}{P(A|B)} \cdot \overset{\color{blue}{\text{marginal probability}}}{P(B)} = \overset{\color{red}{\text{joint probability}}}{P(A,B)} \]

which is the same as

\[ \overset{\color{red}{\text{joint probability}}}{P(A,B)} = \overset{\color{violet}{\text{conditional probability}}}{P(A|B)} \cdot \overset{\color{blue}{\text{marginal probability}}}{P(B)} \]

This is the general product rule (or chain rule) that connects conditional probabilities with joint probabilities.

Deriving Bayes rule

\[ P(A,B) = \color{blue}{P(A|B) \cdot P(B)} \\ P(A,B) = \color{red}{P(B|A) \cdot P(A)} \]

Because both equations have \(P(A,B)\) on the left hand side, we can also write

\[ \color{blue}{P(A|B) \cdot P(B)} = \color{red}{P(B|A) \cdot P(A)} \]

If we want to know what \(P(A|B)\) is, we now have to divide on both sides with \(P(B)\), which gives us

\[ \color{blue}{P(A|B)} = \frac{\color{red}{P(B|A) \cdot P(A)}}{\color{blue}{P(B)}} \]

This is Bayes rule, which one uses to calculate the inverse conditional probability, i.e. when we have information about the probability of \(B\) given \(A\) (\(P(B|A)\)) and want to calculate the probability of \(A\) given \(B\) (\(P(A|B)\)).

Chapter 3 also introduces a diagnostic example. More generally we can ask the question about what is the probability of a disease given a positive test results \(P(D|T+)\).

\[ P(D|T\textrm{+}) = \frac{P(T\textrm{+}|D)\cdot P(D)}{P(T\textrm{+})} \]

This tells us the that probability of disease also depends of the unconditional probability (base rate) of the disease and of positive tests.

Sometimes, Bayes rule is expressed in a more complicated manner, if one adds how to calculate \(P(T+)\): \[ P(D|T\textrm{+}) = \frac{P(T\textrm{+}|D)\cdot P(D)} {P(T\textrm{+}|not D)\cdot P(notD) + P(T\textrm{+}|D)\cdot P(D)} \] However, this is not so relevant for Bayesian inference.

Because this last expression is so complicated, natural frequencies are popular with some

Visualizing the frequencies in a pie chart also helps to understand the issue:

set.seed(123)
par(mar = c(0,0,0,0))
cols = c("violet","red","orange","green4")
pie(c(1,9,89,901), labels = "", border = NA, col = cols )

legend("topleft",
       bty = "n",
       fill = cols,
       legend = c(
         "False negatitve",
         "True positive",
         "False positive",
         "True Negative"
       ))

What are probability distributions, and where do we need them for Bayesian statistics?

Probability distributions are a special form of functions. Here are some functions you have probably already encountered.

In Bayesian statistics, we use such distributions to express three things:

  1. Prior judgement about the probability of different parameter values before seeing the data.
  2. The probability of different parameter values given the data. This is also called the likelihood
  3. The posterior probability of different parameter values given our prior judgement and the data.

Let’s walk through a simple example. We start by describing our prior judgement, that we are slightly confident that that index finger touches water rather than land, with the beta distribution.

What is the posterior probability of land, given 10 W and 3 L tosses?

First we define a grid and plot a the prior “probability” values:

p_grid = seq(0,1,by = .05)
prior = dbeta(p_grid,2,1)
plot(p_grid, prior, type = "h", col = "blue",
     ylab = "density", main = "Prior")

Vertical lines indicate the prior plausibility for parameter values in the grid.

Next we calculate the likelihood, i.e. the probability of the data given the model (a binomial distribution), the data (3 W, 3 L) and the parameter p (in p_grid):

likelihood = dbinom(10,13,p_grid)
plot(p_grid, likelihood, type = "h", col = "red",
     ylab = "density", main = "Likelihood")

Vertical lines indicate the plausibility of the data for the parameter values in the grid.

And now we can calculate the un-normalized posterior as a product of prior and likelihood:

posterior = prior * likelihood
plot(p_grid, posterior, type = "h", col = "violet", 
     ylab = "density", main = "Posterior")

Vertical lines indicate the posterior plausibility of the parameter values in the grid, given the data and the prior.

Here is a plot with all three together:

par(mar = c(5.1,4.1,1,2.1))
ylim = c(0,max(c(likelihood,posterior,prior)))
plot(p_grid, prior, type = "h",col = "blue", 
     ylab = "density", ylim = ylim)
lines(p_grid+.01, likelihood, type = "h", col = "red")
lines(p_grid-.01, posterior, type = "h", col = "violet")
legend("topleft",col = c("blue","red","violet"),
       lty = 1, legend = c("Prior","Likelihood","Posterior"), 
       bty = "n")

We can make the plot easier to view by normalizing all values so that they sum up to 1 for prior, likelihood and posterior. In each distribution, only the relative values at the different points of the grid are relevant!

n_prior = prior/sum(prior)
n_likelihood = likelihood/sum(likelihood)
n_posterior = posterior/sum(posterior)

Above, we have information about priors only at relatively few points. Lets increase the resolution of the grid:

fp_grid = seq(0,1,by = .02)
f_prior = dbeta(fp_grid,2,1)
f_likelihood = dbinom(10,13,fp_grid)
f_posterior = f_prior * f_likelihood

f_prior = f_prior/sum(f_prior)
f_posterior = f_posterior/sum(f_posterior)
f_likelihood = f_likelihood/sum(f_likelihood)
f_ylim = c(0,max(c(f_likelihood,f_posterior,f_prior)))

par(mar = c(5.1,4.1,1,2.1))
plot(fp_grid, f_prior, type = "h", col = "blue",
     ylab = "normalized density", ylim = f_ylim)
lines(fp_grid+.005, f_likelihood, type = "h",col = "red")
lines(fp_grid-.005, f_posterior, type = "h", col = "violet")
legend("topleft",col = c("blue","red","violet"),
       lty = 1, legend = c("Prior","Likelihood","Posterior"), 
       bty = "n")

This display gets quickly hard to read. Because we also know that the plausibility of parameter values is similar to the plausibility of adjacent parameter values, we can draw an outline instead of many vertical lines:

fp_grid = seq(0,1,by = .01)
f_prior = dbeta(fp_grid,2,1)
f_likelihood = dbinom(10,13,fp_grid)
f_posterior = f_prior * f_likelihood

f_prior = f_prior/sum(f_prior)
f_posterior = f_posterior/sum(f_posterior)
f_likelihood = f_likelihood/sum(f_likelihood)
f_ylim = c(0,max(c(f_likelihood,f_posterior,f_prior)))

par(mar = c(5.1,4.1,1,2.1))
plot(fp_grid, f_prior, type = "l", col = "blue",
     ylab = "normalized density", ylim = f_ylim)
lines(fp_grid, f_likelihood, type = "l",col = "red")
lines(fp_grid, f_posterior, type = "l", col = "violet")
legend("topleft",col = c("blue","red","violet"),
       lty = 1, legend = c("Prior","Likelihood","Posterior"), 
       bty = "n")

Ways to calculate the posterior

Analytically

For example, if we are a priori assume that a finger is equally likely to land on land and water, we can say our prior is a Beta distribution with parameters \(\alpha\) (successes) = 1 and \(\beta\) (failures) = 1: \(\small Beta(1,1)\). If we then observe \(n =\) 10 tosses, where the figure lands on water for \(x =\) 7 tosses, the posterior distributions is a Beta distribution with parameters \(\alpha\) = 1 + 7 = 8 and \(\beta\) = 1 + 3 = 4: \(\small Beta(8,4)\).

More generally, with priors \(\alpha\) and \(\beta\) \[ Posterior = Beta(\alpha + x, \beta+n-x) \]

However, this solution is only possible in the few situations where there exist conjugate priors, i.e., priors that are conjugate to the likelihood such that the posterior has the same distribution as the prior. For instance, the beta distribution is the conjugate prior of the parameter p of the binomial distribution. The normal distribution is a conjugate prior for the mean parameter of the normal distribution, and the Inverse Gamma distribution is the conjugate prior for the standard deviation of the normal distribution.

Grid approximation

  • We calculate the posterior probability for a number of prior probability values.

This works if we have a limited number of parameters.

To get samples from this probability distribution, we use the posterior probabilities as weights when we sample from the prior values for which we calculate the posterior.

Quadratic (Laplace) approximation

  • We find the mode of the posterior distribution, which is also called the maximum a posteriori or MAP1
  • We use the curvature at this mode2 to approximate how the entire posterior distribution looks like.

This works if the posterior distribution has a (multivariate) normal distribution. (Which also means that it has only one mode)

To get samples we simulate from the (multivariate) normal distribution that describes the posterior.

Markov Chain Monte Carlo (MCMC)

  • A simulation based approach

Also works for non-normal and multi-modal posterior distributions, but needs more time.

MCMC directly produces samples.

Learning things from the posterior

The posterior distribution contains all our knowledge given the prior, model, and data. If possible, we should show the reader the full posterior. If the posterior distribution has a simple shape, which is often the case, we can calculate certain statistics to summarize the posterior.

We walk through some concepts with the following posterior.

We can display posterior distributions in different ways:

samples1 = rbeta(2000,10,2.5)
par(mfrow = c(2,2), bty = "n", mar=c(3,3,2,1), mgp=c(2,.7,0), tck=-.01)
plot(1:length(samples1), samples1, main = "dot plot",
     ylab = "parameter value", xlab  = "sample #")
plot(1:length(samples1), samples1, type = "l", main = "trace plot",
     ylab = "parameter value", xlab  = "sample #")
colored.hist(samples1, length.out = 30)
plot(density(samples1), main = "density", xlim = c(0,1), xlab = "parameter value")

I prefer histograms over density plots, because density plots can distort things.

Mean, median, and mode

Quintiles

What is the value below which x% of the posterior distribution falls.

Q05 = quantile(samples1,.05)
colored.hist(samples1, length.out = 60, upper = Q05)

Credible intervals

An x% credible interval (CI) is the central interval that contains x% of the posterior distribution

Can also be called posterior intervals. This is what some people think confidence intervals are.

CI80 = PI(samples1, prob = .8)
colored.hist(samples1, length.out = 60,
             lower = CI80[1],
             upper = CI80[2])
title(paste("Interval length = ",round(diff(CI80),3)))
abline(v = CI80, lwd = 2, col  = "blue")
text(CI80, c(3000,3000), labels = round(CI80,2), pos = c(2,4), col = "blue")

Credible intervals are calculated with quintiles. For instance, if the 80% credible intervals goes from the 10% ((100-80)/2) quintile to the 90% quintile.

Highest density posterior interval

An x% Highest density posterior interval (HDPI) is the shortest interval that contains x% of the posterior distribution.

HPDI80 = HPDI(samples1, prob = .8)
colored.hist(samples1, length.out = 60, color1 = "violet",
             lower = HPDI80[1],
             upper = HPDI80[2])
title(paste("Interval length = ",round(diff(HPDI80),3)))
abline(v = HPDI80, col = "violet", lwd = 2)
abline(v = CI80, col = "blue", lwd = 2, lty = 3)
text(CI80, c(3000,3000), labels = round(CI80,2), pos = c(2,4), col = "blue")
text(HPDI80, c(2500,2500), labels = round(HPDI80,2), pos = c(2,4), col = "violet")

Posterior predictions

How do we know that our model is any good?

In the workflow of Bayesian data analysis, we check “model fit” visually, by comparing if the predictions made by the model are consistent with the data.

We do this by taking parameters from the posterior distribution, and then simulate data from the model with these parameters.

If the model and parameters are good, the simulated data should look similar to the observed data.

Lets look at some more globe tosses: We get 13 out of 20 tosses water and calculate the posterior distribution for proportion water with a flat prior.

n_water = 13
n_total = 20
p_grid = seq(0,1,.01)
prior = rep(1,length(p_grid))
likelihood = dbinom(n_water, n_total, prob = p_grid)
posterior = prior*likelihood
posterior = posterior / sum(posterior)
plot(p_grid,posterior,'l',
     main = "Posterior distribution for the proportion of water")

To simulate data from our posterior distribution, we have to use the posterior probabilities of the parameter values. Lets first simulate 1000 simulated data sets, i.e. number of water tosses, using the median of the posterior:

# generate samples from the posterior distribution
posterior_samples = 
  sample(p_grid,5000, prob = posterior, replace = TRUE)
# simulate data given the median of the 
# posterior distribution for proportion of water
predictions.point = 
  rbinom(5000,n_total, prob = median(posterior_samples))
simplehist(
  predictions.point, 
  xlab = "# predicted water tosses",
  main = "Posterior predictive distribution from median")
points(n_water,0,pch = 17, col = "green3", cex = 3)
points(round(mean(predictions.point)),0, col = "red", pch = 16, cex = 2)

The random process involved in simulating the data (we repeatedly toss 10 biased coins) makes that our posterior predictions are variable even if we always use the same (median) probability.

However, in a Bayesian approach, we do not use a point estimates of parameters to generate posterior predictions, but the full posterior distribution.

predictions.dist = rbinom(5000,n_total, prob = posterior_samples)
simplehist(
  predictions.dist,
  ylim = c(0, max(table(predictions.point))),
  col = "blue",
  xlab = "# predicted water tosses",
  main = "Posterior predictive distribution")
lines(as.numeric(names(table(predictions.point)))+.2,
      table(predictions.point), type ="h", lwd = 3)
points(n_water,0,pch = 17, col = "green3", cex = 2.5)
points(round(mean(predictions.dist)),0, col = "red", pch = 16, cex = 2)
legend("topleft",
       bty = "n",
       col = c("blue","black"), lty = 1, lwd = 3,
       title = "Predictions from posterior",
       legend = c("distribution",
                  "median"))

The uncertainty described by the full posterior distribution is reflected in a somewhat wider posterior predictive distribution.

Now lets redo the analysis, but instead of using the binomial distribution we use another distribution for integer data, the poisson distribution. The sole parameter of the poisson distribution is the mean number of expected counts.

m_grid = seq(0,40,.1)
prior = rep(1,length(m_grid))
likelihood = dpois(n_water, lambda = m_grid)
posterior = prior*likelihood
posterior = posterior / sum(posterior)
plot(m_grid,posterior,'l',
     main = "Posterior distribution for expected water tosses.")

We can already see that something is not right here. Lets generate posterior predictions:

# generate samples from the posterior distribution
posterior_samples = sample(m_grid,5000, prob = posterior, replace = TRUE)
# simulate data given the median of the 
# posterior distribution for proportion of water
predictions.dist = rpois(5000,posterior_samples)
simplehist(
  predictions.dist, xlab = "# predicted water tosses",
  main = "Posterior predictive distribution from 'wrong' model.")
points(round(mean(predictions.dist)),0,pch = 17, col = "green3", cex = 2.5)
points(n_water,0, col = "red", pch = 16, cex = 2)
abline(v = 20.5, lty = 3, col = "magenta")

These posterior predictions are clearly incompatible with the data, because they allow for impossible values of 11 or more water tosses, even though our sample size is only 10.

The lesson learned her should be that if we use a model that is not appropriate for the data, our statistical model can lead to impossible predictions, which in turn might lead to wrong conclusions. One of the more popular ways to use a “wrong model” is using linear regression for binary data when the prevalence of the outcome is either very low (<5%) or very high (>95%).

Here is a posterior predictive distribution from another bad model. What aspect of model is problematic?

n_water = 13
n_total = 20
p_grid = seq(0,1,.01)
prior = dbeta(p_grid,1,10)
likelihood = dbinom(n_water, n_total, prob = p_grid)
posterior = prior*likelihood
posterior = posterior / sum(posterior)
posterior_samples = 
  sample(p_grid,5000, prob = posterior, replace = TRUE)
predictions.dist = rbinom(5000,n_total, prob = posterior_samples)
simplehist(
  predictions.dist,
  xlab = "# predicted water tosses",
  main = "Bad posterior predictive distribution (I)")
points(n_water,0,pch = 17, col = "green3", cex = 2.5)
points(round(mean(predictions.dist)),0, col = "red", pch = 16, cex = 2)

Summary


  1. with some optimization scheme↩︎

  2. How quickly the posterior changes if one moves away from the mode↩︎